Overview

Acute kidney injury (AKI) is one of the most common complications seen in hospitalized patients. The ability to predict which patients are at highest risk for AKI would allow clinicians to intervene and prevent poor patient outcomes. Despite many attempts, however, AKI predicting models fail to perform in way that motivates their use in clinical decision support. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on ICU status and admission type, can inform future studies seeking to predict AKI. This study aims to quantify the predictive power of various risk factors for AKI based on admission type and patient location in the hospital.

Introduction

A common condition in hospitalized patients, rates of AKI are estimated to be as high as 7% in hospitalized patients and 30% in ICU patients. (Goyal et al.). AKI is a sudden loss of renal function due to non-renal causes, such as dehydration, reduced blood flow to the kidney, nephrotoxic medications, or sepsis. AKI is usually reversible, but if not addressed can progress to permanent kidney damage and multi-organ failure. Diagnosing AKI early can prevent poor patient outcomes (Goyal et al.). Many groups have attempted to build decision making tools to assist clinicians in predicting AKIs early in the disease stage allowing rapid intervention and preventing sequelae. However, the majority of AKI prediction tools, including many using state of the art machine learning methods on large, diverse datasets, struggle to improve on a trained clinicians ability to identify AKIs (De Vlieger et al). In particular, the black box issue of machine learning algorithms limits the clinicians ability to trust the models, understand what is happening to the patient, and base decisions on model outputs. One way to enhance model transparency, while improving performance is to understand which risk factors are important to model decision making. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on ICU status and admission type can inform future work seeking to predict AKI.

Prior research has built a data set of patients on combination anti-hypertensive with either NSAID or oxycodone to assess AKI risk (Miano et al). This dataset can be re-analyzed to identify which of the studied risk factors contributes most to prediction of AKI. Patients on combination antihypertensives and either NAISDs or oxycodone represent a common subset of hospitalized patients, thus results on this population can be expected to generalize well to all patients. This study seeks to answer the question: in patients receiving treatment for hypertension and pain, do the most predictive risk factors for AKI differ by ICU vs non-ICU and admission type? A secondary question explored by this study is: do models perform better when continuous variables are treated as categorical variables?

AKI prediction is a prominent interdisciplinary space. Clinicians, informaticians, computer scientists, and epidemiologists frequently work together and publish in this space. The work typically combines new machine learning approaches with classic clinical definitions of kidney injury and risk factors. No discipline alone has enough expertise to address the problem, making interdisciplinary collaboration key to success. Conversations with Todd Miano (PharmD, PhD research at University of Pennsylvania’s Department of Biostatistics, Epidemiology, and Bioinformatics (DBEI) and Critical Care Pharmacist) highlighted the difficulty in defining AKI though KDIGO’s multipart definition based on the non-specific marker, serum creatinine. Informatician, John Holmes (Professor of Medical Bioinformatics in Epidemiology also at the University of Pennsylvania DBEI) emphasized the fact that the various prediction models incorporate similar, large sets of risk factors. He suggested that different risk factors my play varying roles for patients in different clinical settings, such as patients in the ICU or post-operative patients. These discussions motivated the proposed study of quantifying predictive power of various risk factors, but splitting the analysis based on admission type and patient location.

A git repo for this project is available here.

Methods

Data

Todd Miano provided the data used in this project, and was previously described in Miano et al. Briefly, data was collected on 27,741 adult patients with greater than 24-hours exposure to one both an anti-hypertensive agent and an analgesic. Ant-hypertensives of interest included either a renin-angiotensin inhibitor (RAS-I) or the calcium channel blocker amlodipine. Analgesics of interest included wither an NSAID or oxycodone. Data was collected between 2014 and 2017 from patients admitted to University of Pennsylvania Health System hospitals using the Penn Data Store warehouse for electronic health records in the health system. Patients were excluded if they had contraindications to NSAIDs, including unresolved AKI within 2 weeks before entry, baseline serum creatinine >2mg/dl (an indicator of renal injury), end stage renal disease, renal replacement therapy, platelet count \(< 100*10^{11}\) (risk of bleed), pregnancy (a contraindication to RAS-I treatment), lack of baseline or follow-up serum creatinine, and history of solid organ transplant. Information was also collected variables associated with AKI, including cardiovascular conditions, diabetes, and cancer. Laboratory values and presence of nephrotoxic medications at entry are also included. In this data set, AKI’s and AKI severity were labeled according to Kidney Disease Improving Global Outcomes (KDIGO) criteria.

In this data set, comorbid conditions and laboratory values were assessed at the point in time where the patient qualified for the study (entry). That is when the patient had greater than 24-hours exposure to the drug-combinations of interest. Thus, when using this data to predict AKI, we necessarily are predicting AKI at the same point in time. Therefore, the prediction methods evaluated in this study attempt to predict AKIs using information available to clinicians up to 24 hours after initiating combination therapy of anti-hypertensives and opioids. The goal of this research is to identify which predictors are the most helpful in predicting AKIs in this setting, and how they differ, if at all, between the medical floor and the ICU.

To do this, the dataset was imported and examined. There were no missing data. However, there were abnormally high and low BMI values. Subjects with BMI <12 and >60 were removed. Next, the data were separated it into 1) the complete dataset, 2) ICU cohort, and 3) the medical floor cohort.

A second set of data with continuous variables categorized by reference ranges was created. The continuous variables BMI, white blood cell count, hemoglobin, platelets, sodium, potassium, chloride, creatinine, and index GFR were categorized based on reference ranges (CDC, Kratz 2022;, UPenn, KDIGO). The categorical variables age was categorized as 18-40, 40-65, and greater than 65. Prior length of stay was grouped as zero days, 1-2 days, 3-7 days, and greater than seven days. Age and prior length of stay ranges were chosen arbitrarily.

Logistic Regression Analysis

Generalized logistic regression models were created for each data sub-set. K-fold cross validation were used to estimate prediction error. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.

Predictor importance for logistic regression models were identified based on test-statistic magnitude and statistical significance.

Decision Tree Analysis

Next, decision tree models were trained for each model. K-fold cross validation were used to estimate prediction error. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. To build the forest, 100 trees were used. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.

Predictor importance was identified based on magnitude of mean decrease in GINI importance scores. Importantly, mean decrease in GINI importance scores are known to be inflated when involving continuous variables and categorical variables with many categories. When calculating GINI scores, the GINI index is calculated for each cut-point in a classification tree. Variables that are continuous or have many categories will have many more potential cut-points than dichotomous variables and therefore have a higher probability of coincidentally producing a better cut-point and getting a higher importance score (Nembrini, 2018; Strobl, 2007). To evaluate the effect of continuous variables on mean decrease in GINI scores for this study, the analysis was repeated using the dataset with all continuous variables categorized according to reference ranges (see above).

Import and Load Packages

# Import packages, if needed
install.packages("dplyr")
install.packages("gtsummary")
install.packages("modelsummary")
# Load packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtsummary)
library(modelsummary)
library(ggplot2)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tibble)
library(tidyr)
library(stringr)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(PRROC)

Read in the Data

Data were uploaded to Zenodo and are available here.

#Load the data
data <- read.csv("/Users/haedi/Repos/BMIN503_Final_Project/BMIN503_Final_Project/miano_thelen_dataset.csv", header = TRUE)

# Drop columns not needed:
# ddiGroup was needed to classify exposure in the prior study, but not needed
# here, nsaidType is missing data for 81% and so was left out. 
# aki_stage, rrt, and mortHosp30 are outcome variables needed in the prior study
# but not here; we are only interested in binary AKI as the outcome variable. 

# Table 1, sorted by ICU stay, variables renamed for readability
data %>%
  select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30)) %>%
  filter(bmi>12 & bmi<60) %>%
  rename(Analgesic = pain, Anithypertensive = bp, AKI = aki, Age = age, 
         Sex = sex, Race = race, BMI = bmi,  "Admission Type" = admType, 
         "Prior Length of Stay" = priorLos, "Perioperative Day" = periOp, 
         Ventilator = vent, "Congestive Heart Failure" = chf, 
         "Chronic Pulmonary Disease" = cpd, 
         HIV = hiv, "Liver Disease" = liver, "periveral Vascular Disease" = pvd,
         "Ceribrovascular Disease" = cva, "Myocardial Infarction" = mif, 
         "Valvular Disease" = valve, Hypertension = htn, 
         "Cardiac Arrhythmias" = arry, 
         "Pulmonary Circulation Disorder" = pCirc, Obesity = obese, 
         "Weight Loss" = wtLoss,
         "Fluid and Electrolyte Disorder"= fluid, "Chronic Kidney Disease" = ckd,
         "Atrial Fibrillation" = afib, "Obstructive Sleep Apnea" = osa, 
         "Diabetes Mellitus" = dm, Cancer = cancer, "Prior AKI" = prior_aki,
         Vancomycin = vanco,
         "Bactrim" = bactrim, "Vasopressors" = pressor, 
         "Other Nephrotoxins" = ntxOther, 
         "Nephrotoxic Antibiotics" = abxNTX, "Diruetics" = diuretic, 
         "Broad Spectrum Antibiotics" = gramNegBroad, 
         "Narrow Spectrum Antibiotics" = gramNegNarrow,
         "WBC x10^9cells/L" = wbc, "Hmoglobin g/dl" = hgb, 
         "Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium, 
         "Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride,
         "Serum Creatinine at entry" = creatinine, "Baseline GFR" = indexGFR) %>%
   mutate(Cancer = factor(Cancer, level = c(0,1,2), 
                          labels = c("None", "Non-metastatic", "Metastatic"))) %>%
  mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
  tbl_summary(by = icu)
Characteristic non-ICU, N = 23,0311 ICU, N = 4,4851
Analgesic
    NSAID 4,667 (20%) 704 (16%)
    Oxycodone 18,364 (80%) 3,781 (84%)
Anithypertensive
    CCB 4,792 (21%) 1,045 (23%)
    RAS 18,239 (79%) 3,440 (77%)
AKI 1,867 (8.1%) 585 (13%)
Age 64 (55, 72) 63 (54, 71)
Sex
    Female 11,304 (49%) 1,694 (38%)
    Male 11,727 (51%) 2,791 (62%)
Race
    Black 8,916 (39%) 1,125 (25%)
    Other / Unk 2,145 (9.3%) 575 (13%)
    White 11,970 (52%) 2,785 (62%)
BMI 30 (26, 35) 29 (25, 34)
Admission Type
    Medicine 9,335 (41%) 1,039 (23%)
    Surgery 13,696 (59%) 3,446 (77%)
Prior Length of Stay 2 (1, 3) 2 (1, 4)
Perioperative Day
    Not post-op 17,474 (76%) 2,860 (64%)
    POD one 2,503 (11%) 978 (22%)
    POD three 736 (3.2%) 173 (3.9%)
    POD two 1,907 (8.3%) 371 (8.3%)
    POD zero 411 (1.8%) 103 (2.3%)
Ventilator 340 (1.5%) 682 (15%)
Congestive Heart Failure 5,696 (25%) 1,799 (40%)
Chronic Pulmonary Disease 6,411 (28%) 1,478 (33%)
HIV 347 (1.5%) 37 (0.8%)
Liver Disease 1,184 (5.1%) 212 (4.7%)
periveral Vascular Disease 3,429 (15%) 1,092 (24%)
Ceribrovascular Disease 2,259 (9.8%) 646 (14%)
Myocardial Infarction 3,251 (14%) 1,206 (27%)
Valvular Disease 3,476 (15%) 1,232 (27%)
Hypertension 20,664 (90%) 3,777 (84%)
Cardiac Arrhythmias 4,477 (19%) 1,359 (30%)
Pulmonary Circulation Disorder 2,078 (9.0%) 640 (14%)
Obesity 5,161 (22%) 871 (19%)
Weight Loss 1,418 (6.2%) 429 (9.6%)
Fluid and Electrolyte Disorder 5,783 (25%) 1,640 (37%)
Chronic Kidney Disease 2,909 (13%) 602 (13%)
Atrial Fibrillation 3,913 (17%) 1,181 (26%)
Obstructive Sleep Apnea 3,362 (15%) 664 (15%)
Diabetes Mellitus
    DM-Comp 2,004 (8.7%) 339 (7.6%)
    DM-Non-comp 6,454 (28%) 1,209 (27%)
    None 14,573 (63%) 2,937 (65%)
Cancer
    None 18,747 (81%) 3,851 (86%)
    Non-metastatic 2,912 (13%) 441 (9.8%)
    Metastatic 1,372 (6.0%) 193 (4.3%)
Prior AKI
    None 20,906 (91%) 3,748 (84%)
    Yes-resolved 2,125 (9.2%) 737 (16%)
Vancomycin 4,702 (20%) 1,555 (35%)
Bactrim 713 (3.1%) 66 (1.5%)
Vasopressors 541 (2.3%) 643 (14%)
Other Nephrotoxins 526 (2.3%) 64 (1.4%)
Nephrotoxic Antibiotics 555 (2.4%) 363 (8.1%)
Diruetics 9,193 (40%) 2,298 (51%)
Broad Spectrum Antibiotics 2,717 (12%) 641 (14%)
Narrow Spectrum Antibiotics 9,072 (39%) 2,047 (46%)
WBC x10^9cells/L 8.9 (6.9, 11.4) 10.8 (8.3, 14.0)
Hmoglobin g/dl 11.00 (9.60, 12.40) 11.00 (9.60, 12.40)
Platelets x10^11/L 221 (173, 282) 194 (149, 251)
Sodium, mEq/L 138.0 (135.0, 140.0) 138.0 (136.0, 140.0)
Potassium, mEq/L 4.10 (3.80, 4.40) 4.10 (3.80, 4.40)
Chloride mEq/L 103.0 (101.0, 106.0) 105.0 (102.0, 109.0)
Serum Creatinine at entry 0.90 (0.72, 1.10) 0.90 (0.71, 1.08)
Baseline GFR 72 (57, 88) 75 (59, 89)
1 n (%); Median (IQR)
# Mutate binary variables to factors, keep only variables we will use
data.c.f <- data %>%
  select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30,)) %>%
  filter(bmi>12 & bmi<60) %>%
  mutate(across('race', str_replace, 'Other / Unk', 'Other')) %>%
  mutate(aki = factor(aki, level = c(0,1), labels = c("No AKI", "AKI"))) %>%
  mutate(cancer = factor(cancer, level = c(0,1,2), 
                         labels = c("no_cancer", "non-metastatic", "metastatic"))) %>%
   mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
  mutate_at(c("pain", "bp", "sex", "race", "admType", "periOp", "vent", "chf", 
              "cpd", "hiv","liver", "pvd", "cva", "mif", "valve", "htn", "arry",
              "pCirc", "obese", "wtLoss", "fluid", "ckd", "afib", "osa", "dm", 
              "prior_aki", "vanco", "bactrim", "pressor","ntxOther", "abxNTX",
              "diuretic","gramNegBroad", "gramNegNarrow"), factor) %>%
  relocate(where(is.factor))%>%
  relocate(aki)

# Break up the datasets to ICU and Non-ICU cohorts 
icu <- data.c.f %>%
  filter(icu=="ICU") %>%
  select(!c(icu))

nonicu <- data.c.f%>%
  filter(icu=="non-ICU")%>%
  select(!c(icu))

Table 1 shows patient characteristics in the ICU vs non-ICU.

Next, we build the categorical dataset.

# Categorizing the continuous variables using reference ranges
data.c.f.cat = data.c.f %>%
  mutate(age = cut(age, breaks = c(18, 40, 65, Inf), 
                        labels = c("18-40", "41-65", ">65"),
                        include.lowest = T)) %>%
  mutate(bmi = cut(bmi, breaks = c(-Inf, 18.5, 25, 30, Inf), # CDC
                        labels = c("Underweight", "Healthy-weight", 
                                   "Overweight","Obesity"))) %>%
  mutate(priorLos = cut(priorLos, breaks = c(-Inf, 0, 2, 7, Inf),
                            labels = c("0", "1-2", "3-7", ">7"),
                            inclue.lowest = F)) %>%
  mutate(wbc = cut(wbc, breaks = c(-Inf, 3.5, 9.1, Inf), # Harrison's
                            labels = c("low", "ref", "high"))) %>%
  mutate(hgb = cut(hgb, breaks = c(-Inf,12, 16, Inf), # Harrison's, combo M&F
                        labels = c("low", "ref", "high"))) %>%
  mutate(platelets = cut(platelets, breaks = c(-Inf, 149, 400, Inf),# Upenn
                                    labels = c("low", "ref", "high"))) %>%
  mutate(sodium = cut(sodium, breaks = c(-Inf, 135, 144, Inf),# Upenn
                              labels = c("low", "ref", "high"))) %>%
  mutate(potassium = cut(potassium, breaks = c(-Inf, 3.5, 5.1, Inf),# Upenn
                                    labels = c("low", "ref", "high"))) %>%
  mutate(chloride = cut(chloride, breaks = c(-Inf, 100, 111, Inf),# Upenn
                                  labels = c("low", "ref", "high"))) %>%
  mutate(creatinine = cut(creatinine, breaks = c(-Inf, 0.49, 1.2, Inf), 
                          # Harrison's, combo M&F
                                  labels = c("low", "ref", "high"))) %>%
  mutate(indexGFR = cut(indexGFR, breaks = c(-Inf, 15, 30, 45, 60, 90, Inf),
                            labels = c("G5", "G4", "G3b", "G3a", "G2", "G1")))

# create ICU and non-ICU subsets
icu.cat <- data.c.f.cat %>%
  filter(icu=="ICU") %>%
  select(!c(icu))

nonicu.cat <- data.c.f.cat%>%
  filter(icu=="non-ICU")%>%
  select(!c(icu))

# Table 2, continuous variables categorized
data.c.f.cat %>%
  select(c(age, bmi, priorLos, wbc, hgb, platelets, sodium, potassium, 
           chloride, creatinine, indexGFR, icu)) %>%
  rename( Age = age, BMI = bmi, "Prior Length of Stay" = priorLos, 
          "WBC x10^9cells/L" = wbc, "Hmoglobin g/dl" = hgb, 
          "Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium, 
          "Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride, 
          "Serum Creatinine at entry" = creatinine, "Baseline eGFR" = indexGFR) %>%
  tbl_summary(by = icu)
Characteristic non-ICU, N = 23,0311 ICU, N = 4,4851
Age
    18-40 1,029 (4.5%) 256 (5.7%)
    41-65 11,749 (51%) 2,391 (53%)
    >65 10,253 (45%) 1,838 (41%)
BMI
    Underweight 545 (2.4%) 115 (2.6%)
    Healthy-weight 4,544 (20%) 1,017 (23%)
    Overweight 6,640 (29%) 1,404 (31%)
    Obesity 11,302 (49%) 1,949 (43%)
Prior Length of Stay
    0 1,580 (6.9%) 273 (6.1%)
    1-2 14,317 (62%) 2,494 (56%)
    3-7 5,390 (23%) 1,253 (28%)
    >7 1,744 (7.6%) 465 (10%)
WBC x10^9cells/L
    low 364 (1.6%) 13 (0.3%)
    ref 11,437 (50%) 1,454 (32%)
    high 11,230 (49%) 3,018 (67%)
Hmoglobin g/dl
    low 16,087 (70%) 3,067 (68%)
    ref 6,768 (29%) 1,384 (31%)
    high 176 (0.8%) 34 (0.8%)
Platelets x10^11/L
    low 3,153 (14%) 1,127 (25%)
    ref 18,312 (80%) 3,176 (71%)
    high 1,566 (6.8%) 182 (4.1%)
Sodium, mEq/L
    low 5,913 (26%) 1,026 (23%)
    ref 16,814 (73%) 3,247 (72%)
    high 304 (1.3%) 212 (4.7%)
Potassium, mEq/L
    low 2,643 (11%) 485 (11%)
    ref 20,010 (87%) 3,930 (88%)
    high 378 (1.6%) 70 (1.6%)
Chloride mEq/L
    low 5,387 (23%) 761 (17%)
    ref 16,986 (74%) 3,188 (71%)
    high 658 (2.9%) 536 (12%)
Serum Creatinine at entry
    low 446 (1.9%) 148 (3.3%)
    ref 19,138 (83%) 3,736 (83%)
    high 3,447 (15%) 601 (13%)
Baseline eGFR
    G5 0 (0%) 0 (0%)
    G4 308 (1.3%) 35 (0.8%)
    G3b 2,174 (9.4%) 384 (8.6%)
    G3a 4,328 (19%) 766 (17%)
    G2 11,113 (48%) 2,216 (49%)
    G1 5,108 (22%) 1,084 (24%)
1 n (%)

Table 2 shows categorized continuous variables for patient characteristics in the ICU vs non-ICU.

Now that the datasets are created, we can build the models and identify the variables that have the best predictive value. We first evaluate the complete, ICU, and non-ICU datasets for both logistic regression and random forest with continuous variables. We then repeat the analysis using categorical variables only.

Results

Visualizing Data

From Table 1, we can see that AKI rates are higher in the ICU. Next, we visualize the relationship between ICU stay and AKI.

ggplot(data = data.c.f, aes(x=icu, fill = aki)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of study patients with AKI by location", 
       x = "ICU Status", y ="Proporiton" )

Figure 1 Proportion of study patients with AKI in the ICU vs non-ICU.

Using a Chi-square test, we can check if there is a statistically significant difference in AKI rates.

# Chi-Square test
chisq.test(table(data.c.f$aki, data.c.f$icu))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data.c.f$aki, data.c.f$icu)
## X-squared = 112.12, df = 1, p-value < 2.2e-16

At the \(\alpha = 0.5\) level, we can conclude that in our study population, the AKI rates are different between non-ICU and ICU settings. This is helpful contextual information for our data, though, it is not part of the main analysis.

Part 1: Analysis with Continous Variables included as continuous

Logistic Regression and Random Forest Models

Next we create a logistic regression and random forest models for all three data subsets (complete, ICU, and non-ICU), then compare the predictors.

Model Building for Complete Dataset

N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(data.c.f, s != i)
    test <- filter(data.c.f, s == i)
    obs.outputs[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions <- predict(glm.aki, test, type = "response")
  pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
  
    #RF train/test
    rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr <- predict(rf, newdata = test, type = "prob") 
    pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
    
    offset <- offset + length(s[s == i])
}
# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
## 
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6924
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf,     ci = TRUE)
## 
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6853
## 95% CI: 0.6743-0.6963 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy",  add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 2 ROC Curve for the Complete Dataset with Continuous Data

# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"], 
               pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)

Figure 3 PR Curve for the Logistic Regression Model for the Complete Dataset with Continuous Data

prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"], 
               pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)

Figure 4 PR Curve for the Random Forest Model for the Complete Dataset with Continuous Data

Mobel Building for ICU Dataset

N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(icu, s != i)
    test <- filter(icu, s == i)
    obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
  pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
  
    #RF train/test
    rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob") 
    pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
## 
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6651
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu,     ci = TRUE)
## 
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6481
## 95% CI: 0.6244-0.6718 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 5 ROC Curves for the ICU Dataset with Continuous Data

# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"], 
               pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)

Figure 6 PR Curve for the Logistic Regression Model for the ICU Dataset with Continuous Data

prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"], 
               pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)

Figure 7 PR Curve for the Random Forest Model for the ICU Dataset with Continuous Data

Mobel Building for non-ICU Dataset

N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(nonicu, s != i)
    test <- filter(nonicu, s == i)
    obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
  pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
  
    #RF train/test
    rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob") 
    pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
## 
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.687
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu,     ci = TRUE)
## 
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6765
## 95% CI: 0.6638-0.6893 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 8 ROC Curves for the non-ICU Dataset with Continuous Data

# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)

Figure 9 PR Curve for the Logistic Regression Model for the non-ICU Dataset with Continuous Data

prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)

Figure 10 PR Curve for the Random Forest Model for the non-ICU Dataset with Continuous Data

ROC performance decreased when the data was split between ICU and Non ICU, vs when data was combined. However, PRC performance increased for the ICU group. But were the top predictors different?

Identifying Top Predictors

First, identify top predictors for the complete dataset.

# Identify top predictors for the combined data set
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z 
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 11 Relative Importance in the Logistic Regression Model for the Complete Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
  filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
  arrange(glm.aki.sum.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 12 Relative Importance in the Random Forest Model for the Complete Dataset with Continuous Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat the process for ICU subset.

# Identify top predictors for the ICU subset
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z 
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 13 Relative Importance in the Logistic Regression Model for the ICU-Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
  filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
  arrange(glm.aki.sum.icu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 14 Relative Importance in the Random Forest Model for the ICU-Dataset with Continuous Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat the process for non-ICU subset.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z 
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 15 Relative Importance in the Logistic Regression Model for the non-ICU-Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
  filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
  arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 16 Relative Importance in the Random Forest Model for the non-ICU-Dataset with Continuous Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Make a table that compare the top predictors for the different methods and the different datasets.

# Make a dataframe with all the results

top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu),  rownames(top.glm.aki.pvals.icu)) %>%
  rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval
##     combined       non-ICU             ICU
## 1  diuretic1        fluid1         abxNTX1
## 2     fluid1     diuretic1            ckd1
## 3       ckd1          ckd1             bmi
## 4   bactrim1      bactrim1       ntxOther1
## 5        age           age             age
## 6     icuICU          chf1       diuretic1
## 7    abxNTX1       wtLoss1            chf1
## 8       chf1    creatinine        priorLos
## 9        bmi periOpPOD two           afib1
## 10      pvd1        liver1 periOpPOD three

Table 3 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets

top.compare.gini <-data.frame(top.rf.gini$feature,  top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini
##    top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1             indexGFR                   indexGFR                indexGFR
## 2                  bmi                        bmi              creatinine
## 3           creatinine                 creatinine                     bmi
## 4            platelets                  platelets               platelets
## 5                  wbc                        wbc               potassium
## 6                  hgb                        hgb                     age
## 7                  age                        age                     wbc
## 8             chloride                   chloride                     hgb
## 9            potassium                  potassium                chloride
## 10              sodium                     sodium                  sodium

Table 4 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets

Discuss the results.

Part 2: Analysis with Continous Variables Included as Categorized

Logistic Regression and Random Forest Models

Model Building for Complete Dataset

Analysis the continuous variables as categorized data.

N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(data.c.f.cat, s != i)
    test <- filter(data.c.f.cat, s == i)
    obs.outputs[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions <- predict(glm.aki, test, type = "response")
  pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
  
    #RF train/test
    rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr <- predict(rf, newdata = test, type = "prob") 
    pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
    
    offset <- offset + length(s[s == i])
}
# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
## 
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.7051
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf,     ci = TRUE)
## 
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6808
## 95% CI: 0.6699-0.6917 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 17 ROC Curves for the Complete Dataset with Categorical Data

# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"], 
               pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)

Figure 18 PR Curves for the Logistic Regression on the Complete Dataset with Categorical Data

prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"], 
               pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)

Figure 19 PR Curves for the Random Forest Model on the Complete Dataset with Categorical Data

Mobel Building for ICU Dataset

N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(icu.cat, s != i)
    test <- filter(icu.cat, s == i)
    obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
  pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
  
    #RF train/test
    rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob") 
    pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
## 
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6838
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu,     ci = TRUE)
## 
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6737
## 95% CI: 0.6512-0.6961 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 20 ROC Curves for the ICU Dataset with Categorical Data

# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"], 
               pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)

Figure 21 PR Curves for the Logistic Regression Model on the ICU Dataset with Categorical Data

prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"], 
               pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)

Figure 22 PR Curves for the Random Forest Model on the ICU Dataset with Categorical Data

Mobel Building for non-ICU Dataset

N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(nonicu.cat, s != i)
    test <- filter(nonicu.cat, s == i)
    obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
  pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
  
    #RF train/test
    rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob") 
    pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
## 
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.698
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu,     ci = TRUE)
## 
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6723
## 95% CI: 0.6597-0.6849 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 23 ROC Curves for the non-ICU Dataset with Categorical Data

# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)

Figure 24 PR Curve for the Logistic Regression Model on the non-ICU Dataset with Categorical Data

prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)

Figure 23 PR Curve for the Random Forest Model on the non-ICU Dataset with Categorical Data

Identifying Top Predictors

# Identify top predictors for the combined data set with categorized variables
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z 
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 24 Relative Importance in the Logistic Regression Model for the Complete Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
  filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
  arrange(glm.aki.sum.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 25 Relative Importance in the Random Forest Model for the Complete Dataset with Categorical Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat process for ICU subset with categorized data.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z 
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 26 Relative Importance in the Logistic Regression Model for the ICU Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
  filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
  arrange(glm.aki.sum.icu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 27 Relative Importance in the Random Forest Model for the ICU Dataset with Categorical Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat process for non-ICU subset with categorized data.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z 
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
  labs(y="Relative Importance (z-score)", x = "Feature")

Figure 28 Relative Importance in the Logistic Regression Model for the non-ICU Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
  filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
  arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")

Figure 29 Relative Importance in the Random Forest Model for the non-ICU Dataset with Categorical Data

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Make a table that compare the top predictors for the different methods and the different datasets using the categorized data.

# Make a dataframe with all the results
top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu),  rownames(top.glm.aki.pvals.icu)) %>%
  rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval
##          combined        non-ICU            ICU
## 1   creatinineref  creatinineref  creatinineref
## 2  creatininehigh         fluid1 creatininehigh
## 3       diuretic1      diuretic1        abxNTX1
## 4        bactrim1 creatininehigh      diuretic1
## 5          fluid1       bactrim1           ckd1
## 6      priorLos>7           ckd1     priorLos>7
## 7            ckd1           chf1      ntxOther1
## 8          icuICU     priorLos>7           chf1
## 9     priorLos3-7      raceWhite     indexGFRG2
## 10        abxNTX1  periOpPOD two     indexGFRG1

Table 5 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets using Categorical Data

top.compare.gini <-data.frame(top.rf.gini$feature,  top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini
##    top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1             indexGFR                        bmi                indexGFR
## 2                  bmi                   indexGFR                     bmi
## 3             priorLos                   priorLos                priorLos
## 4                 race                       race                  periOp
## 5                   dm                         dm                    race
## 6                  age                        age                      dm
## 7             chloride                        wbc                chloride
## 8               periOp                   chloride                     age
## 9            platelets                  platelets               platelets
## 10                 wbc                     sodium                  sodium

Table 6 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets Using Categorical Data # Discussion

References

  1. Goyal A, Daneshpajouhnejad P, Hashmi MF, et al. Acute Kidney Injury. [Updated 2022 Aug 18]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK441896/.

  2. De Vlieger, Greeta; Kashani, Kianoushb; Meyfroidt, Geerta. Artificial intelligence to guide management of acute kidney injury in the ICU: a narrative review. Current Opinion in Critical Care: December 2020 - Volume 26 - Issue 6 - p 563-573 doi: 10.1097/MCC.0000000000000775.

  3. Miano, Todd A., et al. “Effect of renin-angiotensin system inhibitors on the comparative nephrotoxicity of NSAIDs and opioids during hospitalization.” Kidney360 1.7 (2020): 604.

  4. bmi cutpoint: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html#InterpretedAdults

  1. Kratz A, Pesce MA, Basner RC, Einstein AJ. Laboratory Values of Clinical Importance. In: Kasper D, Fauci A, Hauser S, Longo D, Jameson J, Loscalzo J. eds. Harrison’s Principles of Internal Medicine, 19e. McGraw Hill; 2014. Accessed November 23, 2022. https://accessmedicine-mhmedical-com.proxy.library.upenn.edu/content.aspx?bookid=1130&sectionid=79722706

  2. Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373. PMID: 29757357; PMCID: PMC6198850.

  3. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25. PMID: 17254353; PMCID: PMC1796903.